class: center, middle, inverse, title-slide .title[ # Survey Data Analysis with Kobocruncher ] .subtitle[ ## Session 6 - Cleaning and Indicator Calculation ] .author[ ###
Link to Documentation
–
Link to Previous Session
–
Link to Next Session
] .date[ ### Training Content as of 01 December 2022 ] --- ## When do yo need to clean the data? Survey data cleaning involves identifying and removing responses from individuals who either don’t match your target audience criteria or didn’t answer your questions thoughtfully. This filtering is done to avoid drawing misleading conclusions. Data cleaning remains a last resort option that can be at first minimized by: * .large[__Quality of questionnaire design__] not only to minimize social desirability and biased questions but also to ensure that the interview duration is limited (_ideally less than 45 minutes for a face to face interview and less than 25 minutes for a telephone interview_) * .large[__Good form encoding__] - with well defined [constraints](https://xlsform.org/en/#constraints) and [skip logic](https://xlsform.org/en/#relevant) and [requirement](https://xlsform.org/en/#required) to avoid Inconsistent Responses, sufficient testing to ensure that the questions are well understood and the responses options are covering well the options * .large[__Good training for the data enumerator__] and detailed [question hints](https://xlsform.org/en/#hints) so the enumerators fills correctly the questionnaire * .large[__Sufficient data collection quality monitoring__] to identify, prevent and cure issues early on. This can be done through [High Frequency Check](https://unhcr.github.io/HighFrequencyChecks/docs/). This should help to flag Straightlining / Patterned Responses when an enumerator is using the same answer option ("B") over and over (for instance for at least five rows in a grid)... .bg-blue[ For data quality, prevention is lot more effective, quicker and cheaper than curing. Take the time to thoroughly test the questionnaire before starting full on data collection. ] ??? https://dimewiki.worldbank.org/Checklist:_Data_Cleaning https://dimewiki.worldbank.org/Data_Cleaning --- ## Go through your initial exploration report to identify issues In order to guide this selection phase, data experts, in collaboration with the data analysis group, can use the following elements: * For numeric value, check the frequency distributions of each variable to average, deviation, including outlier and oddities * For categorical variables, check for unexpected values: any weird results based on common sense expectations * Use correlation analysis to check for potential contradictions in respondents answers to different questions for identified associations (chi-square) * Always, Check for missing data (NA) or "%of respondent who answered" that you cannot confidently explain * Check unanswered questions, that corresponds to unused skip logic in the questionnaire: For instance, did a person who was never displaced answer displacement-related questions? Were employment-related answers provided for a toddler? --- ## Situation when you will still need minimum cleaning a priori Whatever is quality of form design, enumerator training and data collection quality monitoring, there will be still cases where cleaning will involve removing entire records: * Remove from the dataset records where no consent were obtained and/or more broadly one a specific filter/condition (where the respondent do not meet certain criteria or data from an unreliable enumerator identified during data collection quality monitoring... * Remove duplicate respondent ID based on the original sample list * Remove from the dataset records before or at specific dates * Remove from the dataset records when interview duration appears as outliers, either too long or too short, aka "speed responses" --- ## How to set up filters on the data * Often survey includes nested tables (aka `repeat`), so if you remove records from the main table, you need to remove linked records in the nested tables --- ## Situation when you will still need minimum cleaning a posteriori In other cases, cleaning will involve `recoding` some variables 1. Recode un-explainable .large[__outliers for numerical questions__]. An example of this would be if you asked how much water one person use in a day and someone answered that they use 1000 liters, while the second largest usage reported is 150 liters. 2. Recode questions consecutive from .large[__"or other" choices__]. 3. Recode some questions answer as .large[__new calculated variables__] to have more balanced response categories based on frequency or potential closes meaning --- ## Outliers for numerical questions After reviewing the first exploration report, set up the maximum accepted standard deviation --- ## Recode categories, treat "or other" choices Often, categorical questions include an "or other" variables and this option might be mis-used In the exploratory report, those are plotted with a word cloud --- ## New calculated variables use case * filters on specific criteria * Create a filters on specific criteria * Ratio between 2 numeric variable * Calculation on date * Discretization of numeric variable according to quintile * Discretization of numeric variable according to fixed break * Aggregate variable from nested frame (aka within repeat) to parent table * filters on specific criteria --- ## kobo_indicator In kobocruncher, this is done with [kobo_indicator](https://edouard-legoupil.github.io/kobocruncher/reference/kobo_indicator.html) function. The function goes through steps: 1 - load the already defined indicators in the xlsform 2 - append new indicator supplied to the function if any, 3 - apply the indicator, i.e. do the calculation, 4 - re-save all the working indicator definition within the extended xlsform, dedicated indicator worksheet 5 - bind the new indicators in the dictionary for further plotting 6 - rebuild the plan if indicators are allocated to chapter, subchapter --- ## Create a filters on specific criteria ```r dplyr::if_else(datalist[1]$variable =="criteria", "yes","no") ``` --- ## Ratio between 2 numeric variable ```r datalist[1]$varnum1 / datalist[1]$varnum2 ``` --- ## Calculation on date ```r lubridate::interval( datalist[1]$datetocheck, lubridate::today()) %/% months(1) ``` --- ## Discretization of numeric variable according to quintile ```r Hmisc::cut2(datalist[1]$varnum, g =5) ``` --- ## Discretization of numeric variable according to fixed break for instance case size from integer to categoric ```r cut(datalist[1]$casesize, breaks = c(0, 1, 2, 3,5,30), labels = c("Case.size.1", "Case.size.2", "Case.size.3", "Case.size.4.5", "Case.size.6.or.more" ), include.lowest=TRUE) ``` --- ## Aggregate variable from nested frame (aka within repeat) to parent table ```r datalist[2] |> dplyr::select( members.sex, parent_index) |> tidyr::gather( parent_index, members.sex) |> dplyr::count(parent_index, members.sex) |> tidyr::spread(members.sex, n, fill = 0) |> dplyr::select( female) ``` --- class: inverse, center, middle # TIME TO PRACTISE ON YOUR OWN! ### .large[.white[
] **5 minutes! **]
−
+
10
:
00
Open again your expanded xlsfrom, set up the outliers treatment, clean the _"or_other"_ in the .large[clean] column. Then add calculated variables. Save and knit again your report! Do not hesitate to raise your questions in the [ticket system](https://github.com/Edouard-Legoupil/kobocruncher/issues/new) or in the chat so the training content can be improved accordingly! --- class: inverse, center, middle ### .large[.white[
] **Let's take a break! **]
−
+
05
:
00
__Next session__: [07-Anonymising In order to provide other people to work with the data, a first level of data anonymisation should be implemented](07-Anonymising.html)